Distributed Representations for Biological Sequence Analysis

نویسندگان

  • Dhananjay Kimothi
  • Akshay Soni
  • Pravesh Biyani
  • James M. Hogan
چکیده

Biological sequence comparison is a key step in inferring the relatedness of various organisms and the functional similarity of their components. Thanks to the Next Generation Sequencing efforts, an abundance of sequence data is now available to be processed for a range of bioinformatics applications. Embedding a biological sequence – over a nucleotide or amino acid alphabet – in a lower dimensional vector space makes the data more amenable for use by current machine learning tools, provided the quality of embedding is high and it captures the most meaningful information of the original sequences. Motivated by recent advances in the text document embedding literature, we present a new method, called seq2vec, to represent a complete biological sequence in an Euclidean space. The new representation has the potential to capture the contextual information of the original sequence necessary for sequence comparison tasks. We test our embeddings with protein sequence classification and retrieval tasks and demonstrate encouraging outcomes.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

iProsite: an improved prosite database achieved by replacing ambiguous positions with more informative representations

PROSITE database contains a set of entries corresponding to protein families, which are used to identify the family of a protein from its sequence. Although patterns and profiles are developed to be very selective, each may have false positive or negative hits. Considering false positives as items that reduce the selectiveness of a pattern, then, the more selective pattern we have, a more accur...

متن کامل

Strong convergence for variational inequalities and equilibrium problems and representations

We introduce an implicit method for nding a common element of the set of solutions of systems of equilibrium problems and the set of common xed points of a sequence of nonexpansive mappings and a representation of nonexpansive mappings. Then we prove the strong convergence of the proposed implicit schemes to the unique solution of a variational inequality, which is the optimality condition for ...

متن کامل

Biological Activity Analysis of Native and Recombinant Streptokinase Using Clot Lysis and Chromogenic Substrate Assay

Determination of streptokinase activity is usually accomplished through two assay methods: a) Clot lysis, b) Chromogenic substrate assay. In this study the biological activity of two streptokinase products, namely Streptase®, which is a native product and Heberkinasa®, which is a recombinant product, was determined against the third international reference standard using the two forementioned a...

متن کامل

Biological Activity Analysis of Native and Recombinant Streptokinase Using Clot Lysis and Chromogenic Substrate Assay

Determination of streptokinase activity is usually accomplished through two assay methods: a) Clot lysis, b) Chromogenic substrate assay. In this study the biological activity of two streptokinase products, namely Streptase®, which is a native product and Heberkinasa®, which is a recombinant product, was determined against the third international reference standard using the two forementioned a...

متن کامل

dna2vec: Consistent vector representations of variable-length k-mers

One of the ubiquitous representation of long DNA sequence is dividing it into shorter k-mer components. Unfortunately, the straightforward vector encoding of k-mer as a one-hot vector is vulnerable to the curse of dimensionality. Worse yet, the distance between any pair of one-hot vectors is equidistant. This is particularly problematic when applying the latest machine learning algorithms to so...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1608.05949  شماره 

صفحات  -

تاریخ انتشار 2016